{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 5 - Non-parametric distributions and bootstrap\n", "\n", "Last lab we looked at parametric distributions like the normal distribution and exponential distribution. Parametric distributions can be described by a (mathematical) function and their exact shape is determined by parameters (mean and standard deviation for the normal distribution; the rate $\\lambda$ for the exponential distribution).\n", "\n", "Today we will look at *non-parametric distributions* which either cannot be described by a mathematical function or the exact mathematical function is unknown.\n", "\n", "We will start with the restaurant inspection data from Assignment 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "\n", "# show all columns\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a dataframe with the restaurant inspection data, remembering to make the type of the inspection date column datetime." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question 3 of Assignment 1 asked if the mean inspection score in January was different from the mean inspection score in July. We are going to look at this question in more detail.\n", "\n", "First let's create a dataframe with only the January inspections." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many rows are in this new dataframe? Recall we can find this out with the `.shape` property." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the histogram of the January inspection scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the histogram look like either of the parametric distributions?\n", "\n", "It has some characteristics of an exponential distribution, but in this lab we will treat this sample as its own non-parametric distribution.\n", "\n", "We want to know what range of values are likely for its mean. To find this, we will *re-sample* from the sample, meaning we will create new samples of the same size by *sampling with replacement* from the original sample. For each of the new samples, we will compute the mean." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create an empty list\n", "sample_means_jan = []\n", "# loop to create 500 samples\n", "for i in range(500):\n", " # sample once from the SCORE column with replacement\n", " sample = jan_df[\"SCORE\"].sample(11463, replace = True)\n", " # compute the mean of the new sample\n", " sample_mean = sample.mean()\n", " # add the mean to the list\n", " sample_means_jan.append(sample_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the sample means. Remember to convert `sample_means` into a Pandas Series first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This histogram is approximating the *sampling distribution of the mean*.\n", "\n", "What parametric distribution does this histogram look like? Do you remember why?\n", "\n", "Let's find the range containing 95% of the sample means, which is also called the *95% confidence interval*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find the 2.5 percentile of the means. Only 2.5% of the sample means are smaller than this number.\n", "pd.Series(sample_means_jan).quantile(0.025)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find the 97.5 percentile of the means. 97.5% of the sample means are smaller than this number, \n", "# so only 2.5% of the sample means are larger than this number.\n", "pd.Series(sample_means_jan).quantile(0.975)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, the 95% confidence interval is approximately [20.0, 20.6]. The interpretation of a confidence interval is if we sample from a distribution and compute the 95% confidence interval, then 95% of the time this confidence interval will contain the true mean of the distribution.\n", "\n", "Now, compute the 95% confidence interval for the mean of the July scores.\n", "\n", "First, create a dataframe containing only the July inspections and find the number of rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next create 500 bootstrap samples (samples of the same size with replacement) of the July inspection scores:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the histogram of the means:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the 95% confidence interval for the means of the bootstrap samples of the July scores:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's your 95% confidence interval?\n", "\n", "Does it overlap with your 95% confidence interval of the mean of the January scores?\n", "\n", "Since these two confidence intervals will contain the true means 95% of the time, if the intervals do not overlap, we can say that there is a statistically significant difference in the January and July score means.\n", "\n", "We can visually check this too by plotting the two histograms of the means on the same plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think this difference in means implies about the distributions of the January and July inspection scores? Do you think the scores have the same distribution?\n", "\n", "Let's plot the histograms of the distributions of the January and July scores on the same plot to visually compare:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall we can make the histograms transparent by adding the parameter `alpha = 0.5`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If you are finished early:\n", "- Is there a statistically significant difference in the means of the January scores and the February scores?\n", "- In Lab 4, we fit a normal distribution to the babies' weights. If we take a sample of size 44 from that fitted normal distribution, how does its mean compare to the mean of the babies weights? Does this result tell us anything about whether the babies' weights come from a normal distribution? " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }